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A systolic algorithm rhythmically computes and passes data through a 
network of processors. We investigate the performance of systolic algo¬ 
rithms for implementing the gravitational IV-body problem on distributed- 
memory computers. Systolic algorithms minimize memory requirements 
by distributing the particles between processors. We show that the per¬ 
formance of systolic routines can be greatly enhanced by the use of non- 
blocking communication, which allows particle coordinates to be commu¬ 
nicated at the same time that force calculations are being carried out. The 
performance enhancement is particularly great when block sizes are small, 
i.e. when only a small fraction of the N particles need their forces com¬ 
puted in each time step. Hyper-systolic algorithms reduce the communica¬ 
tion complexity from O(Np), with p the processor number, to 0(N^/p), at 
the expense of increased memory demands. We describe a hyper-systolic 
algorithm that will work with a block time step algorithm and analyze its 
performance. As an example of an application requiring large N, we use 
the systolic algorithm to carry out direct-summation simulations using 10 6 
particles of the Brownian motion of the supermassive black hole at the 
center of the Milky Way galaxy. We predict a 3D random velocity of ~ 0.4 
km s' 1 for the black hole. 


1. INTRODUCTION 

Numerical algorithms for solving the gravitational IV-body problem have evolved 
along one of two lines in recent years. Direct-summation codes compute the com¬ 
plete set of N 2 interparticle forces at each time step; these codes were designed 
for systems in which the finite-IV graininess of the potential is important, and are 
limited by their 0(N 2 ) scaling to moderate (N < 10 5 ) particle numbers. The best- 
known examples are the NBQDY series of codes introduced by Aarseth [1], These 
codes typically use high-order schemes for integration of particle trajectories and 
avoid the force singularities at small interparticle separations either by softening, 
or by regularization of the equations of motion [2]. A second class of iV-body 
algorithms replace the direct summation of forces from distant particles by an ap- 

1 




2 


DORBAND, HEMSENDORF, MERRITT 


proximation scheme. Examples are tree codes [3], which reduce the number of direct 
force calculations by collecting particles in boxes, and algorithms which represent 
the large-scale potential via a truncated basis-set expansion (e.g. [4]) or on a grid 
(e.g. [5]). These algorithms have a milder, 0(N log N) scaling for the force calcu¬ 
lations and can handle much larger particle numbers although at some expense in 
decreased accuracy [6]. 

The efficiency of both sorts of algorithm can be considerably increased by the use 
of individual time steps for advancing particle positions, since many astrophysically 
interesting systems exhibit a “core-halo” structure characterized by different regions 
with widely disparate force gradients. An extreme example of a core-halo system is 
a galaxy containing a central black hole [7]. The efficiency of individual time steps 
compared with a global time step has rendered such schemes standard elements of 
direct-summation codes (e.g. [8]). 

Here we focus on direct-summation algorithms as implemented on multi-processor, 
distributed-memory machines. Applications for such codes include simulation of 
globular star clusters, galactic nuclei, or systems of planetesimals orbiting a star. 
In each of these cases, values of N exceeding 10 5 would be desirable and it is natu¬ 
ral to investigate parallel algorithms. There are two basic ways of implementing a 
parallel force computation for 0{N 2 ) problems. 

1. Replicated data algorithm. All of the particle information is copied onto 
each processor at every time step. Computing node i, 1 < i < p, computes the 
forces exerted by the entire set of N particles on the subset of n, = N/p particles 
assigned to it. 

2. Systolic algorithm. At the start of each time step, each computing node 
contains only N/p particles. The sub-arrays are shifted sequentially to the other 
nodes where the partial forces are computed and stored. After p— 1 such shifts, all 
of the force pairs have been computed. 

(The term “systolic algorithm” was coined by H. T. Kung [9, 10] by analogy with 
blood circulation.) Both types of algorithm exhibit an O(Np) scaling in communica¬ 
tion complexity and an 0(N 2 ) scaling in number of computations. The advantage 
of a systolic algorithm is its more efficient use of memory: since each processor 
stores only a fraction 1/p of the particles, the memory requirements are minimized 
and a larger number of particles can be integrated. Other advantages of systolic 
algorithms include elimination of global broadcasting, modular expansibility, and 
simple and regular data flows [10]. 

The performance of a systolic algorithm suffers, however, whenever the number of 
particles on which forces are being computed is less than the number of computing 
nodes. This is often the case in core-halo systems since only a fraction of the 
particles are advanced during a typical time step. As an extreme example, consider 
the use of a systolic algorithm to compute the total force on a single particle due 
to N other particles. Only one processor is active at a given time and the total 
computation time is 


Nt/ + p(n + t c ) 


(l) 


where t/ is the time for one force calculation, t; is the latency time required for two 
processors to establish a connection, and r c is the interprocessor communication 
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time. Thus the algorithm is essentially linear and no advantage is gained from 
having multiple processors. 

An efficient way to deal with the problem of small group sizes in systolic algo¬ 
rithms is via nonblocking communication , a feature of MPI that allows communi¬ 
cation to be put in the background so that the computing nodes can send/receive 
data and calculate at the same time [11]. In a nonblocking algorithm, the time per 
force loop for a single particle becomes 

Nrt 

—- + p{n +r c ). (2) 

p 

The second term is the waiting time for the last computing node to receive the 
particle after p shifts. The first term is the time then required to compute the 
forces from the subset of N/p particles associated with the last node. As long 
as the calculation time is not dominated by interprocessor communication, the 
speedup is roughly a factor of p compared with the blocking algorithm. 

Here we discuss the performance of systolic algorithms as applied to systems with 
small group sizes, i.e. systems in which the number of particles whose positions 
are advanced during a typical time step is a small fraction of the total. Section 
2 presents the block time step scheme and its implementation as a systolic algo¬ 
rithm. Section 3 discusses the factors which affect the algorithm’s performance, and 
Section 4 presents the results of performance tests on multiprocessor machines of 
blocking and nonblocking algorithms. Section 5 presents a preliminary discussion 
of “hyper-systolic” algorithms with block time steps, which achieve an 0(Ny/p) 
communication complexity at the cost of increased memory requirements. Finally, 
Section 6 describes an application of our systolic algorithm to a problem requiring 
the use of a large N: the gravitational Brownian motion of a supermassive black 
hole at the center of a galaxy. 


2. ALGORITHM 

In a direct-force code, the gravitational force acting on particle i is 


F, ; = mi&i 


N 

-Gnu Y, 

fc =i 


TOfc (iy - r fc ) 
|r» - r fc | 3 


( 3 ) 


where mi is the mass of the itli particle and r, its position; G is the gravitational 
force constant. The summation excludes k = i. 

The integration of particle orbits is based on the fourth-order Hermite scheme as 
described by Makino and Aarseth [12]. We adopt their formula for computing the 
time-step of an individual particle i at time i now , 


. _ / | a (Gow)||a^ 2 Hbiow)l + [a(tnow)|~ ,.S 

V |a(tn OW )||a( 3 )(t„ow)| + |a( 2 )(tnow)| 2 ' 

Here a is the acceleration of the ith particle, the superscript (j) denotes the j’th 
time derivative, and p is a dimensionless constant of order 10 -2 ; we typically set rj = 
0.02. With a definition of individual time-steps as in equation (4) the computing 
time for the integration of typical gravitational systems is significantly reduced in 
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comparison with codes which do a full force calculation at each integration step. 
Since the particle positions must be up-to-date in equation (3) they are predicted 
using a low order polynomial. This prediction takes less time if groups of particles 
request a new force calculation at large time intervals, rather than if single particles 
request it in small time intervals. For this reason, an integer n is chosen such that 


kr < - < a 


( 5 ) 


with A ti given by equation (4). The individual time step is replaced by a block 
time step A;where 


A bU = 



( 6 ) 


We implement the systolic force calculation in either blocking or nonblocking 
mode. For the nonblocking calculation, we have to add a buffer for storing the 
incoming positions and masses, one for the outgoing positions and masses, and one 
compute buffer. The maximal size of these buffers is defined by the largest parti¬ 
cle number on one processing element. We also need input, output and compute 
buffers for the forces and the time-derivatives of the forces. Since data synchroniza¬ 
tion is not critical with blocking communication, extra input data buffering is not 
necessary. A positive side effect of the buffering strategy is that data access is far 
more ordered than in other implementations of the Hermite scheme. As a result, 
the number of cache misses is reduced, optimizing performance. 

We arrange all processors in a ring-like structure, so that each processor has a 
right and a left neighbor. For the integrator to work, the individual block time- 
step A bti and the time of the last integration hast must be defined. Either the 
initialization or the integration method are required to compute these two quanities. 


Algorithm 1 (Find new group). 

Search all particles i for the smallest f m i n , p = A bU tP + hast, i,p 
on each computing node p. 

Do a global reduce so that each processor knows the global 
minimum t m ; n = min(hnin,p)- Set the simulated 
time to f n0 w = ^min- 

Find the particles with A f,h,p + hasty, p = t now 
and store their index i in a list. 

Predict the positions and velocities at the time 
how for the local particles on each node. 

Each processor will select a subgroup s p of the block size s = s p . The systolic 
shift with blocking communication is implemented as follows: 

Algorithm 2 (force loop (blocking communication)). 

Copy the positions of the subgroups s p into the 
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Nonblocking systolic force computation: Ideal calculation dominated case 

from PE4 V from PE4 S N from PE4 N s 



FIG. 1. Workflow diagram for an ideal, calculation-dominated nonblocking systolic force 
computation. Time increases to the right; the bold-faced arrows represent the work of each of the 
computing nodes, assumed here to be four. The dashed arrows indicate the flow of the position 
information between the nodes. Circles indicate the points when each processor switches from 
computing the partial forces from one subgroup to the next subgroup, including the time to finalize 
one communication thread and initialize the next one. Vertical lines represent barriers which can 
only be passed when all processes reach the same state of execution. 
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compute buffer on each node, 
foreach j in s p 

Do the force calculation with respect to all local particles. 

end foreach 

Wait for all processors to finish their work. 

Copy the masses, positions, and partial forces to the output 
buffer. 

Send the output buffer to the right neighbor and store the 
data from the left neighbor in the compute buffer. 

Continue at line 0 unless p shifts have 

been done and the forces returned to their originating processor. 

Utilizing nonblocking communication, the systolic loop can gain significant per¬ 
formance since it is possible to transfer data while the force calculation is ongoing. 
The partial forces follow one cycle behind the positions. 

Algorithm 3 (force loop (nonblocking communication)). 

Copy the positions of the subgroups s p into the 
compute buffer and into the output buffer on each node. 

if nloops between 1 and p — 1: 

Initiate the data transfer for the positions and masses to 
the right node and allow data to be sent from the left neighbor 
to the input buffer. 

end if 

if nloops between 2 and p: 

Initiate the data transfer for the forces to the right node 
and allow data to be sent from the left neighbor to the input buffer. 

end if 

foreach j in s p 

Do the force calculation with respect to all local particles. 

end foreach 

if nloops between 1 and p — 1: 

Wait for the data transfer of particles to finish. 

end if 

if nloops between 2 and p: 

Wait for the data transfer of the forces to finish. 

Add the partial forces to the incoming forces. 

end if 

Increment nloops and continue at line 0 
unless p shifts have been made. 

Shift the forces to the right neighbor. 
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Figure 1 shows an idealized workgraph of the nonblocking systolic force computa¬ 
tion, assumed to be completely calculation-dominated, which would be the case for 
a computer system with zero latency and infinite bandwidth. In the example shown 
in Figure 1, processor one finds si = 30, processor two S 2 = 10, processor three 
s 3 = 40, and processor four S 4 = 20 particles due for the force calculation. The 
thin vertical lines represent barriers which all processes can only pass together. The 
circles represent points in time at which a process changes from the computation of 
subgroup Si to the next subgroup. Since all communication is in the background, 
switching between the subcalculations is very fast and the processing elements do 
not idle. The dashed arrows represent the data flow between processors; they be¬ 
gin at the sending point and terminate where the reception is finalized. In Figure 
1 , communication between processing elements one and two must be carried out 
in a very short time as indicated by the steep inclination of the arrows. This is 
why, of all communications in the force loop, the transfer of S 3 between these two 
processors requires the highest bandwidth. In the ideal case, however, all processes 
finalize the force calculation at the same time as is shown by the second vertical 
bar. We discuss in the following section how closely we can reach this ideal in 
typical applications. 

3. FACTORS AFFECTING THE PERFORMANCE 
3.1. General performance aspects 

The performance of systolic force calculations is dependent on the calculation 
speed of each processing element and on the bandwidth of the inter-process com¬ 
munication network. Also important is the parallelism of the calculation and the 
load imbalance. However, predicting these latter two quantities requires precise 
knowledge of the distribution of work on the nodes and of numerous machine pa¬ 
rameters. Some of these parameters might also be dependent on the overall usage 
of the parallel computer. For these reasons we restrict our estimates to an optimal 
and a worst-case scenario. 

Let N be the total number of particles and s the number of particles whose 
coordinates are to be advanced during the current integration step. The particles 
to be integrated are distributed over p subgroups of size Sj, 1 < i < p, such that 
the ith processor contributes s, particles. Thus 

p 

Y, Si=s ' ( ? ) 

*=1 



Note that the s, need not be equal, since the number of particles due to be inte¬ 
grated may be different on different processors. We assume that the time spent on 
force calculations is a linear function of the group size Sj, and similarly that the 
communication time is a linear function of the amount of transported data. 

The time required for the force calculation can be estimated as follows. Let r/ 
be the time required to do one pairwise force calculation. The total number of 
force calculations that one processor has to do per full force loop is ( N/p)s , since 
the processor calculates the force of its ( N/p ) own particles against the s particles 
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of the group. The total time for one full force loop in the calculation-dominated 
regime is therefore 


tf = — T f- ( 9 ) 

The time necessary for communicating a subset of particles s, from one processor 
to its neighboring one can be estimated as follows. Let r c be the time required to 
transfer the information for one particle to the next processor. The latency time n 
is the time it takes to set up the communication between two processors. Within 
a single integration step, each processor has to establish p connections and has to 
send s particles. The total communication time is therefore: 

t c = pn + st c . (10) 


3.2. Blocking communication 

If the communication is not delegated to a communication thread, the force 
calculation phase for each subgroup .s, is followed by a communication phase. We 
first consider the case that all group sizes Si are the same. Then the total time is 
just the sum of tf and t c , or 


Ns 

t = - Tf + pn + ST C . 

p 


The optimal processor number is obtained when dt/dp = 0, or 


( 11 ) 


Popt - 



( 12 ) 


and 


t op t = st c + 2^/TfTlNs. 


(13) 


We now consider the case that the block sizes s,; are not equal. Each shift 
of particles must now wait for the processor with the largest block size s max to 
complete its calculation and communication. So for each shift, the time t s is 


N 

ts — Smax'Tf t t t ^maxN 
p 


(14) 


and after p shifts, 


t — NSmaxTf + pT[ pS max T c . 


(15) 


Defining 


Smax 


{Si) + A rnaxSi 

S A 

“ 1 “ ^maxSii 


V 


(16) 
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we can write: 


t 


Ns 


Tf + PV + ST C + NA max SiTf + pA 


max°i 1 c • 


Setting dt/dp = 0, 


(17) 


Popt — 


TfNs 


Tl + T c A„ 


(18) 


and the time is 


t 0 pt = 2 ^ N st f {r t + T c A max s.i) + st c + NA r 


S t Tf. 


(19) 


3.3. Nonblocking communication 

On a system which allows concurrent communication and calculation, the systolic 
algorithm can become more efficient, since the effective cost of communication is 
reduced. In this case the communication routines return immediately after initial¬ 
izing the communication process. As described in more detail in section 2 , each 
processor has to perform the following steps p times for each integration step: 

1. Do the force calculation for the Si particles of one subgroup. 

2. Simultaneously send the s,; particles to the next processor. 

3. Simultaneously receive the Sj_i particles from the previous processor. 

Each of these steps starts at the same time, but they might take different amounts 
of time to finish. Since (2) and (3) behave in a quite similar way, we treat them 
together and call them communication. Step (1) is called calculation. A sys¬ 
tem in which the calculation takes more time than the communication is called 
a calculation-dominated system , and a system in which communication is dominant 
we call a communication-dominated system. Our goal is to derive approximate ex¬ 
pressions for the total time, calculation plus communication, per integration step 
and to minimize this time. 

Figure 2 shows a communication-dominated system. In this example, r; = 0 
and t c = Tf. With this communication speed and the very uneven distribution 
of subgroups Si on the processors, it is not possible to ensure a continuous force 
calculation. The time to compute the forces is dominated by the communication 
of the largest group (s 3 in Figure 2). With this scenario we are able to define 
a threshold for the communication speed, which ensures that processors are not 
waiting for the communication to finish. Let t Ctma x be the processor-to-processor 
communication time for the largest subgroup s max and t, f tTn i n the force calculation 
time for the smallest subgroup s m i n . Continuous calculation can be expected when 

tc,max ~ (20) 

The values for the latency time and the throughput are dependent on the specific 
MPI implementation. Cray T3Es typically have a two-mode implementation with 
algorithms for small and for large messages. On the T3E-900 in Stuttgart, the 
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Nonblocking systolic force computation: Non-ideal case 

' 2 ^ from PE4 from PE4 'a ' 2 ^ from PE4 

PEI, si =30 


PE2, s2=10 


PE3, s3=40 


PE4, s4=20 


"a 



FIG. 2. Workflow diagram for a mixed communication- and calculation-dominated systolic 
force calculation. In this example we have chosen the communication time to be a linear function 
of the calculation time with t Cy i = tf^ and 77 = 0. The hatched blocks symbolize times when 
the processors perform the force calculations. The circles at the lower ends of these blocks denote 
times when the communication of positional data is initiated. The ones at the upper end denote 
the times when the communication has completed. The dashed arrows show the dataflow. In 
order to simplify this graph, the communication of partial forces is omitted. 
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sustained average latency and throughput has been measured to be 77 = 6 ps 
for the routines MPI_Send and MPI_Recv [13]. The MPI implementation on this 
machine actually has a four-mode scheme which provides the following measured 
bandwidths (3: 


/3r 

> 

220 Mbyte/sec 

for 

L-rn 

> 

8 kbyte, 

P2 

> 

300 Mbyte/sec 

for 

Lm 

> 

64 kbyte, 

Ps 

> 

315 Mbyte/sec 

for 

Lm 

> 

256 kbyte. 


The quantity L m denotes the message length. These figures show that the actual 
throughput is dependent on the machine architecture, load and MPI implementa¬ 
tion [13]. 

Assuming all s, to be equal, a rough estimate for an optimal processor number 
can be given. Figure 3 shows that there is an optimum value p = p op t which 
minimizes the total time in this case. We find p opt by setting (9) equal to (10): 


Popt — 


— st c + \J s 2 t 2 + Atit/Ns 

2 n 


and the total time is 

topt = + l\J s2r c +^ T f Ns - 


( 21 ) 


( 22 ) 


Equations 22 and 21 become far simpler on machines having zero latency. With 
77 = 0 we find 

Popt = N T -l, (23) 

T c 

fopt = st c . (24) 


This means that the systolic force loop parallelizes extremely well on such an ideal¬ 
ized computer so that the computing time is only dominated by the performance of 
the intercommunication network. However this equation also allows the use of more 
processors than particles if the communication is faster than the force calculation. 
We give a more detailed picture of the situation in the following. 

In iV-body systems with a low central concentration, i.e. small core density, s 
is usually proportional to N 2 / 3 [12]. In our benchmarks discussed below, however, 
we find power-law indices of 0.55 ± 0.05 for the Plummer model, 0.437 ± 0.105 for 
the Dehnen model, and 0.492 ± 0.110 for the Dehnen model with a black hole. For 
this reason, we estimate the groupsizes to be s ss ^iV 1 / 2 . The constant £ can be 
derived for each type of dataset from the measured values for s in Table 1. 

In our scheme, a given processor may be calculation- or communication-dominated 
depending on the value of s t that is currently assigned to it. We can compute the 
time per integration loop in this more general case by again focussing on just one 
processor, which carries out p operations of force calculation and communication 
per loop. Each of these p operations will be either calculation- or communication- 
dominated, requiring either a time of tf^ or t Ci j. With 1 < i < p, 

N 

t f.i — 'ATf 1 
p 


(25) 
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0 P 

FIG. 3. Communication and calculation time as a function of processor number p for fixed 
N as derived from equations (9) and (10). This figure is only valid under the assumption of equal 
Si- 


t c ,i — 1~i + SiT c . 


(26) 


With the notion of the threshold in equation (20), which guarantees a continuous 
calculation phase, we have a completely calculation-dominated system. Thus 


p 


A ale = £ t fti , 

(27) 

L -J. 

N . . 

= - STf. 

V 

(28) 


It is not trivial, though, to define an optimal processor number, since the distribu¬ 
tion of the Si and the size of s is a dynamical quantity of the integrated particle set. 
The result from equation (28) describes the ideal scaling for the force calculation 
which we use below as a basis of comparison with the benchmark data. 

If the threshold (20) is not fulfilled, a significant share of the computation time t 
comes from the communication. Assuming a worst-case scenario, where all particles 
of the group s are found on only one processor, the computation time becomes 
maximal: 


t 


comm 


t f , max + £ t c , max , 


tf,max pic,r 


(29) 

(30) 
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The overall calculation time for the situation shown in Figure 2 follows equation 
(30). We use this equation below to estimate the minimal scaling behavior expected 
from our code. 

3.4. Comparing the performance of the two methods 

In comparison with the nonblocking algorithm, the blocking scheme does not 
fall behind in terms of the parallelism. However, as long as the force computations 
remain calculation-dominated in the nonblocking scheme, minor variations between 
the Si can be levelled out so that perfect load balancing is guaranteed. The blocking 
scheme is not flexible in this regard so that it builds up a penalty of the order 
?iA max s,;r/ per force loop. Since equation (17) shows that there are a few particles 
in the blocking scheme which are treated with a serial performance, the overall 
parallel efficiency is reduced. This means that a systolic loop will only give satisfying 
performance when load balancing is guaranteed either on a per processor level by 
keeping all s- t the same, or by applying nonblocking communication. This is why a 
nonblocking communication scheme is superior to the blocking one. 

Depending on the type of system integrated, the number of available processors, 
the performance of the computer, and the total number of particles, the work 
load for the processors might become very small. Small group sizes are less likely 
to be distributed evenly on the processes; we expect an approximately Poisson 
distribution, or 


A max^i ~ y/Ti. (31) 

For the systolic algorithm, large problem sizes, together with group sizes that are 
larger in the mean than the number of available processors, ensure efficient parallel 
performance with a systolic algorithm. In this case one can expect an ideal linear 
scaling of the performance with processor number. For small problem sizes, the 
amount of communication governs the the performance. 

3.5. More elaborate codes 

Parallelizing an individual block time step scheme is not trivial, as the discussion 
above has shown. However, is has also been shown that a fairly simple scheme 
can profit enormously from threaded, and therefore nonblocking, communication. 
Since the overall communication time is a linear function of the processor number in 
our simple scheme, attempts have been made to reduce the number of communica¬ 
tions. The hypersystolic codes proposed by Lippert et al. [14, 15] and the broadcast 
method proposed by Makino [16] can perform the force calculation by using only 
y/p communication events. However, both methods require identical group sizes on 
each computing node in order to have efficient load balancing. 

Individual block timestep schemes select their particles to be integrated under 
physical criteria defined by the simulated system. This is why a load imbalance 
is unavoidable in general. Assuming there is no extra algorithm that provides a 
perfect distribution of the group, the expected imbalance time would be dependent 
on the particle distribution statistics in equation (31). Thus 

nA max Sj 

Hmb — 7= 7*/ 

Vp 


(32) 
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This implies that the nonblocking systolic code will outperform nonbalanced hy- 
persystolic or broadcast methods unless the processor number is very large. 

4. MEASURED PERFORMANCE 

We evaluated the performance of the algorithms in realistic applications by using 
them to evolve particle models of spherical stellar systems. The evolution was 
carried out for approximately one crossing time, and the benchmarks were based 
on timings for 2000 integration steps. The Hermite integration scheme adjusts the 
group sizes automatically as described above. All experiments were carried out on 
the Cray T3E at the Goddard Space Flight Center. We compiled the executable 
from our C sources using the Cray standard C compiler version 6 .2.0.0 with no 
explicit optimization. 


4.1. Initial conditions 

We consider three models representing spherical stellar systems in equilibrium. 
The Plummer model [17] has mass density and gravitational potential 


p{r) 

4>(r) 


3 GM b 2 
4?r ( r 2 + 52 ) 5/2 ’ 

GM 

\/r 2 + b 2 


(33) 

(34) 


Here M is the total mass and b is a scale length. The many analytic properties 
of this model make it a common test case for benchmarking. For the present 
application, the most important feature of the Plummer model is its low degree of 
central concentration and its finite central density, similar to the density profiles 
of globular star clusters. The Dehnen family of models [18] are characterized by a 
parameter 7 that measures the degree of central concentration. The density profile 
is 


_ (3 - 7 )M a 

P T 4tt r i (r + a ) 4-7 


(35) 


with M the total mass as before and a the scale length. We chose a centrally 
condensed Dehnen model with 7 = 2 , yielding a density profile similar to those of 
elliptical galaxies with dense stellar nuclei. The gravitational potential for 7 = 2 is 


4>(r) = 


GM 

a 


In 


- 2 _). 

r + a J 


(36) 


The central density diverges as r ~ 2 and the gravitational force diverges as r _1 . Our 
third model had a density equal to that of (35) with 7 = 2 and an additional mass 
component consisting of a point particle at the center representing a supermassive 
black hole. The mass M, of the “black hole” was 1% of the stellar mass of the 
model. This is similar to the black-hole-to-galaxy mass ratios observed in real 
galaxies [19]. 

The initial particle velocities for the TV-body realizations of our models were se¬ 
lected from the unique, isotropic velocity distribution functions that reproduce 
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TABLE 1 

Mean Group Sizes for the Benchmarks 


Model 


131072 

65536 

32768 

16384 

Plummer 


172 

128 

89 

54 

Dehnen, M. 

= 0 

13.1 

11.2 

9.2 

5.1 

Dehnen, M, 

= 0.01 

6.7 

4.1 

2.6 

2.5 


p(r) in the corresponding potentials $(r). For all of the models we adopted 
units in which the gravitational constant G and the total mass of the the sys¬ 
tem M were set to unity. The scale length a of the Dehnen model was also 
set to unity, while the Plummer radius b was chosen to be 37r/16. We com¬ 
puted two realizations for each particle number N. The total number of stars 
was N = (16384,32768,65536,131072). 

After the integration over 2000 steps we computed the average group size during 
the benchmark. Table 1 summarizes the results. The centrally condensed Dehnen 
model systems require the use of far smaller groups than the Plummer model. This 
poses a severe challenge to systolic algorithms. 


4.2. Performance of the force calculation 

Since the complexity of the force calculation is 0(N 2 ), compared to O(N) for 
the integrator, it is critical to measure the impact of parallelism on this task. 
We present comparisons of blocking and nonblocking systolic codes on the three 
datasets described above. We count 70 floating point operations per pairwise force 
calculation in our implementation. 

The particle blocks may have very different sizes, and accordingly the efficiency 
will vary from force step to force step. For this reason we measure performance by 
summing the time required to carry out all force computations in a time interval 
corresponding to 2000 steps, then averaging. 

Figures 4-6 show the measured speedups, expressed in terms of the number of 
force calculations per second. In the case of the nonblocking algorithm, the expec¬ 
tation from equation (28), assuming a calculation-dominated system, is 


1 

^calc CK , 

P 


(37) 


corresponding to a linear trend in Figures 4-6. This dependence is recovered for 
each of the three models when N is large. In the Dehnen-model runs with small 
N, the force calculation becomes communication-dominated due to the small group 
sizes (Table 1) and the performance is described by equation (30): 


t 


comm 


OC p 


(38) 


corresponding to a 1/p dependence of the speedup in Figures 4-6. This effect is less 
pronounced for the Plummer models since the mean group size is larger (Table 1). 
The effect is strongest for the Dehnen models with “black hole” since they have the 
smallest group sizes (Table 1). 

In comparison with the runs involving Dehnen models, the Plummer model runs 
show a peculiar behavior. For very low processor number, the performance in- 
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FIG. 4. Results of the benchmarks for the Plummer model initial conditions. The number 
of pairwise force calculations per second has been plotted as a function of processor number NPE. 
Non-blocking and blocking algorithms are indicated via filled and open symbols respectively, for 
the four different particle numbers N listed in Table 1. The dashed line shows the expected 
performance in the case of an ideal, calculation-dominated system (linear speedup). 


creases linearly. However for an intermediate number of computing nodes, the 
performance shows a swper-linear gain until the curve becomes shallow for large 
processor numbers. This super-linearity is a result of the way in which message 
passing is implemented on the T3E, which switches to a different protocol for small 
messages. The mean message size is inversely proportional to the total number of 
processes so that we can profit from this change. As Table 1 shows, the mean group 
size (which is proportional to the mean message size) is significantly smaller for the 
Dehnen models explaining why the super-linearity does not occur. 

The open symbols in Figures 4-6 show the results for the blocking systolic algo¬ 
rithm. In the Plummer model runs, the blocking code managed to calculate a little 
more than half the number of force pairs per second compared with the nonblocking 
code. The performance difference is much more dramatic in the Dehnen model runs 
due to the small group sizes: the blocking code can be as much as a factor of ten 
slower than the non-blocking code. The computing time is asymptotically constant 
in the Dehnen model runs since the force calculation becomes increasingly serial 
for small s, as given by equation (17). For very large N we expect the 1/p scaling 
to take over. This does not appear in the plots because of the inefficiency of the 
blocking force calculation for small group sizes. 

4.3. Performance of the integration 

The performance of the fV-body code depends also on the efficiency of the algo¬ 
rithm that advances the particle positions, which we call the “integrator.” Here we 
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Dehnen model, y = 2.0, M. = 0.0 



FIG. 5. Same as Figure 4, for the Dehnen model initial conditions. The dashed and 
solid lines show the expected scaling for calculation- and communication-dominated systems. The 
performance of the non-blocking algorithm is reduced for small N due to the small group sizes. 


Dehnen model, y= 2.0, M. = 0.01 



NPE 


FIG. 6. Same as Figure 5, for the Dehnen model with a central point particle (“black 
hole”). The influence of small group sizes is even more striking. 
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TABLE 2 

Integrated time after 2000 force loops for the datasets used in 

the benchmarks 

Dataset I 1 2 


Plummer 

16384 


4.57 • 

10 -2 

5.40- 

10 -2 

32768 


3.35 • 

10 -2 

3.08- 

10~ 2 

65536 


1.90- 

10“ 2 

2.09- 

10~ 2 

131072 


1.08- 

10“ 2 

9.95- 

10 -3 

7 = 2.0, M. 
16385 

= 0.0 

1.75 • 

10“ 3 

1.59- 

10~ 3 

32769 


1.08- 

10 -3 

1.35- 

10~ 3 

65537 


5.35 • 

10“ 4 

6 .68- 

10^ 4 

131073 


4.11 • 

10“ 4 

3.26- 

10^ 4 

7 = 2.0, M. 
16385 

= 0.01 

3.99- 

10“ 4 

2.63- 

10^ 4 

32769 


1.53- 

10“ 4 

1.49 • 

10^ 4 

65537 


1.30- 

10“ 4 

1.36- 

10^ 4 

131073 


4.52 • 

10 -5 

8.51 • 

10- 5 


give benchmarks for the full code including the integrator. Our performance goal is 
to reach a linear increase of computer wall-clock time as a function of problem size 
for a constant simulation interval and optimal processor number. The unit of time 
in the simulations is fixed by the gravitational constant (which is set to G = 1), the 
total mass, and the adopted length scale via [T] = (GM/f? 3 ) -1 / 2 [20]. Our perfor¬ 
mance results are based on an integration using 2000 force loops. The integrated 
time in model units is summarized in Table 2. We carried out two integrations for 
each model based on different random realizations of the initial conditions. 

The performance of the blocking scheme is shown in Figure 7, for the same values 
of N and NPE shown in Figures 4-6. We now plot performance as a function of N; 
runs with the same N but different NPE are plotted with the same symbol and their 
vertical scatter is an indication of how well the code benefits from parallelism. In 
the blocking scheme, the points are very localized vertically revealing that the code 
can not profit much from large processor numbers. Larger problem sizes, however, 
are able to optimize the computations in such a way that the scale-up is better than 
quadratic, the scaling in traditional direct-force codes. With the individual time 
step scheme described here we expect a behavior of Ns. Since we observe an IV 1 / 2 
behavior for the mean group size, we can expect a scale-up following a TV 3 / 2 power 
law in our benchmarks. We compare the maximal performance with increasing 
work load in figures 7 and 8 for our code which shows a scale up between 0(N 2 ) 
and 0(7V 3 / 2 ) on a serial machine. In our benchmarks we consider any scaling better 
than that as indicating a benefit from parallel computing. 

In order to quantify the scaleup, we fit the two, best-performing runs for each 
data set. We fit for a and b in the following function: 

t N b 

ywc _ Jv tot 

T 


a 


(39) 
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FIG. 7. Wall clock time in seconds per integrated model unit as a function of particle 
number N for the blocking Af-body code. Different points at given N show the performance for 
the various values of NPE of Figures 4-6. For comparison we plot lines following a linear and a 
quadratic TV-dependence. The lines are positioned in such a way that they mark the maximal 
speed of the nonblocking code. 
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FIG. 8. Same as Figure 7 for the code with nonblocking communication. 
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TABLE 3 

Power-law fits of the two best performing runs in each benchmark 
for the blocking and nonblocking communication runs. 


Dataset 

| blocking, b 

A b 

| nonblocking, b 

A b 

Plummer 


1.65 

0.03 

1.174 

0.112 

7 = 2.0, M. 

= 0.0 

1.702 

0.006 

1.16 

0.03 

7 = 2.0, M. 

= 0.01 

1.612 

0.003 

0.9 

0.1 


The quantity T the integrated time in model units and t wc is the wall clock time 
for the run. Table 3 summarizes the results for b. Since the benchmarks using the 
Plummer model with the nonblocking code did not reach their maximum speedup 
for N = 131072 and 65536, we estimated the scaleup from the runs with smaller 
particle sets only. These values are italicized in Table 3. 

When measuring the overall performance of the integrator, the nonblocking com¬ 
munication scheme makes much better use of the T3E than the blocking scheme: 
the broad vertical bands in Figure 8 show that the integration speed profits from 
increasing processor number. The two lines representing a linear and a quadratic 
behavior are placed at the optimal speed levels for the runs integrating a Plummer 
model. For all three types of model, a nearly linear scaleup can be observed. The 
exceptions are the datasets representing a Plummer model with 131072 and 65536 
particles. The reason for this is no bottleneck. In fact, the nonblocking code scales 
so well that we could not reach the optimal performance with our maximal number 
of 128 processing elements in the benchmarks. 

4.4. Nonblocking vs. Blocking Communication 

While the implementation of a systolic force calculation is simpler and more 
memory-efficient using blocking communication, we have found significant perfor¬ 
mance gains using a nonblocking algorithm. The results of our force calculation 
benchmarks, displayed in Figures 4-6, show up to ten times better performance for 
a code implementing a nonblocking systolic force calculation over one which ap¬ 
plies a blocking scheme. The gain is greatest when the typical particle group size is 
smallest. The integration as a whole can profit more from the parallelization in the 
nonblocking scheme than in the blocking: we observe a scaleup close to IV 3//2 for 
the blocking scheme, while the nonblocking scheme clearly reaches a linear scaling. 
This means that the nonblocking scheme is actually able to reduce the complexity 
of the direct force integration to O(N). 

5. HYPERSYSTOLIC FORCE CALCULATION WITH 
INDIVIDUAL BLOCK TIME STEPS 
5.1. Systolic and hyper-systolic matrices 

Introduced by Lippert et al. [14, 15], the hyper-systolic algorithm reduces com¬ 
munication costs in 0(N 2 ) problems by improving the communication complexity 
from O(Np) (systolic algorithm) to 0{N- s /p). The price paid is an increased need 
for memory, of order 0(y/p) [15]. 

The standard systolic algorithm is performed on a one-dimensional ring of p 
processors. In order to transfer all positions and forces, p shifts of all N particles 
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are required for each force calculation. Thus the communication complexity is 
O(Np). 

By analogy with the systolic case, we define the vector of input data x = 
(xi, X 2 ,x p ). In this definition the x% are all data initially stored on processor 
Pi. These data are copied to a second vector x that is shifted between the proces¬ 
sors. (Note our assumption here that the blocksize is N/p, i.e. that every particle 
needs its forces updated. We relax this assumption below.) To calculate the total 
forces one has to take into account all possible pairs of ay with ay. To visualize the 
systolic algorithm, we write down a matrix that shows the intermediate states of x 
at each step of the force calculation. This matrix is called the “systolic matrix” S. 
The first line shows x at shift zero, the ?’th line shows x at shift i: 


/I 2 3 4 5 6 7 8 \ 
81234567 
7 8 1 2 3 4 5 6 

6 7 8 1 2 3 4 5 

56781234 
4 5 6 7 8 1 2 3 

3 4 5 6 7 8 1 2 

( 2 3 4 5 6 7 8 1/ 


(40) 


Local computations are done within one column of this matrix. One notices that 
there is some redundancy of vertical pairs: for instance, the pair ( 1 , 2 ) occurs 8 
times. As a result of this redundancy it is not necessary to perform all of the shifts 
in order to get every possible pair of elements of x. In fact it would be sufficient to 
perform just three shifts, so that we get a matrix with just four rows. This matrix 
is called the “hyper-systolic matrix” H and is 


/I 2 3 4 5 6 7 8 \ 
81234567 
7 8 1 2 3 4 5 6 
( 5 6 7 8 1 2 3 4/ 


(41) 


In contrast to the matrix S , which allows us to get all pairs by comparing the 
first row with one of the other rows, not all of the pairs in H have one member 
in the first row; for example, the pair (ay, 2 : 4 ) requires comparing rows 2 and 4. 
Therefore all of the shifted data in the hyper-systolic algorithm must be stored 
on each node in order to compute the forces. This is the reason for the increased 
memory requirements in comparison with the systolic algorithm. 

The hyper-systolic matrix H can be characterized by a vector called the hyper- 
systolic base Ak = ( 0 , ai,a*,) where a,; gives the stride of the ith shift and k is 
the number of shifts that have to be done. In our example of H for p = 8 , A 3 would 
be: 


A 3 = (0,1,1,2). (42) 

To minimize the communication costs, one wants to minimize k, the number of 
shifts. This is a nontrivial problem and optimization techniques have been described 
for achieving this aim [ 21 ]. 
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In the following we show that the complexity of communication is O(n^Jp) [14]. 
The minimum number of pairings required for the total force calculation is p{p — 
l)/2. The number of possible combinations within a column of k + 1 elements is 
( fe 2 1 ) J thus: 


p(p — 1) . /fc + l\ (k + l)k 

2 ) P= 2 T ' 


(43) 


The solution for this inequality for k > 0 is 


k > \jp-\~ ( 44 ) 

which we wanted to prove. However, since we do 0(y/p) shifts and we have to save 
the intermediate data, we need 0 (y/p) more memory on each processor than with 
the standard systolic algorithm. 


5.2. Hyper-systolic matrix with block time steps 

The hyper-systolic algorithm described by Lippert et al. [14, 15] assumes that 
all N(N — l)/2 force pairs are to be computed at each time step. When dealing 
with block time steps, however, only a subset of the full N particles are shifted 
at each time step, and a minimal basis like that of equation (41) is not sufficient 
to compute the required forces. This is because the full data are only stored in 
the first row and data pairs constructed using other rows only contain information 
about the block particles. 

Nevertheless one can construct a different hyper-systolic matrix H that ensures 
complete force calculations. For p = 8 a possible matrix would be: 


/l2345678\ 

8 1 2 3 4 5 6 7 
78123456 
67812345 
5 6 7 8 1 2 3 4 

(4 5678123) 


(45) 


The bold figures represent full data sets and the non-bold figures indicate the block 
data sets. With this matrix of shifts, one can calculate the complete forces. For 
example, for the first block data set, one uses columns 2, 3,4 and 5. The block data 
can be overwritten as in the systolic algorithm. The number of shifts needed for 
this algorithm is 6. First the block data have to be shifted (4 shifts); then they 
have to be shifted back to their home processor to add the forces (1 shift); and 
finally the full data sets have to be updated (1 shift). With the standard systolic 
algorithm one would need 8 shifts. We pay for this advantage with a higher amount 
of memory required: one more full data set has to be saved on each node. 

One can construct matrices H for an arbitrary number of processors p. We call 
the number of total data sets (bold rows) n and the number of block data sets k. 
The number of rows in the matrix is 


k + k = k. 


(46) 
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To get the full force on each block data set, its contents must be compared with 
every full dataset. So each block data set has to meet each full data set during one 
of the shifts. For each block data set there are p — 1 full data sets to be met. Thus 

n x k > p — 1. (47) 

Equation (47) allows us to find all possible combinations of k and k. Table 4 gives 
the complete list for p = 9 processing elements. For all these combinations, one can 
find a possible hyper-systolic matrix that guarantees a complete force calculation. 
The H matrix is constructed by using the following basis vector: 

block data full data 

A k = (0,1,1,?.., 1,1, k,k,...,k). (48) 

'-v-' "-,-' 

k times k —2 times 

The quantity n determines the amount of memory needed, while k determines 
the amount of communication. This is because the number of shifts needed is 
(k + 1) + (k — 1) = k (one shift for each ic-line, one shift for bringing the data to 
their home processors and k — 1 shifts to update the full data sets). 


TABLE 4 


k, k and k for p = 9 

k I k II k 


1 

2 

3 

4 

5 

6 

7 

8 


8 

4 

3 

2 

2 

2 

2 

1 


9 

6 

6 

6 

7 

8 
9 
9 


The combination of values in the second line of Table 4 provides the fastest 
communication. The number of shifts k is minimal as is the redundancy of data k. 
For optimal memory usage, the first line is the best choice which leads actually to a 
systolic force loop. The last line in Table 4 represents the shared memory approach 
where a copy of the whole dataset is kept on each processor. 

We now compute the minimum number of shifts we have to perform with these 
algorithms, nk is approximately p — 1 (equation 47). Substituting into equation 
(46) we get: 

k = --b n. (49) 

K 


The quantity k is the number of shifts that are to be performed. Thus we want to 
minimize it: 


dk 

dn 


P- 1 
k 2 


+ 1 = 0 


(50) 
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and so 


« = \Jv- 1- (51) 

Thus we need VP — 1 full data sets on each processor. Substituting (51) into (49) 
we get 


k = 2\Jp — 1 (52) 

and by virtue of (46), 

R = k = \Jp — 1. (53) 

The R + 1 shifts of the block data can be done with the nonblocking scheme 
described above. The k— 1 update shifts have to be done, when the force calculation 
is completely finished and therefore cannot be moved to the background. 

5.3. Performance 

The performance of this class of algorithm will now be discussed for a blocking 
communication scheme. The calculation time, from section 3, is: 

Ns 

t f = —T f . (54) 

The communication time for one shift is r; + sr c /p (equation 25) assuming that the 
s block particles are distributed equally over the processors, so that s* = s/p. The 
number of shifts that have to be performed is k = k + R (equation 46). Thus the 
total communication time is 

t c = k . (55) 

Thus, using a blocking communication scheme, the total time for one integration 
step is 

f s \ N s 

t = t c + tf = k (77 T -T C J + —Tf- (56) 

In the case of minimal communication cost (equation 52) the communication cost 
is 


t c 


2 y/pn + 2 ~^=t c , 

Vp 


(57) 


where we approximate k by 2 y/p. We get the optimal number of processors that 
minimize the calculation time by setting the first derivative of p with respect to t 
to zero. This leads us to 


Popt T ; - Pl'ptSTc - NsT f = 0. (58) 

Since the hypersystolic codes are aimed at very massive parallel machines, we as¬ 
sume that p is a number of order 100 or larger. The second term then becomes 
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much smaller than the first one and: 


Popt 



(59) 


As we measure in our benchmarks, the mean groupsize s oc TV 1 / 2 . In this casep opt ~ 
7V(t//tz) 2/3 . So the optimal processor number is independent of the communication 
bandwidth of the host computer while this is not the case for the systolic algorithm 
in equation (21). 

The performance just described is similar to that of the two-dimensional ring 
algorithm introdued by Makino [16]. If we choose k = 2 y/p we get the same amount 
of communication time and the same amount of communication as his algorithm 
needs. Makino’s algorithm however is not strictly hyper-systolic. It is designed for 
a 2D network of processors and the systolic shifts are carried out within rows and 
columns of this 2D array. Such an algorithm is far less flexible than the hyper- 
systolic algorithm described above, for the foilwing reasons. First, it only works 
for processor numbers p for which y/p is an integer. Second, Makino’s algorithm 
corresponds to our hyper-systolic algorithm with k = k = y/p. Therefore it requires 
that y/p full data arrays are saved on each processor. Our algorithm works with less 
memory without losing communication efficiency. For example, for p = 9, Makino’s 
algorithm needs to save 3 datasets on each processor, while with the hyper-systolic 
algorithm, we can choose k = 2 and k = 4 (Table 4). k still equals 6. Thus the 
communication complexity is the same as in the case n = k = 3 while the memory 
requirements are reduced by one. Also the number of shifts that cannot be done in 
a non-blocking way (k — 1) is reduced. 

5.4. Small group sizes and the hyper-systolic algorithm 

Above we discussed the effect of small group sizes on systolic algorithms. We 
defined a group to be small if not all processors provide particles to the group. We 
found that the systolic algorithm with nonblocking communication is able to deal 
with small group sizes in such a way that, over a large percentage of time, the force 
calculation is running parallel. 

In a hyper-systolic scheme, the shifts of the block data can be done with blocking 
or nonblocking routines. Thus the same considerations as in section 3.4 apply. In 
addition we encounter a new problem: In the hyper-systolic algorithm the particle 
data are not sent to each processor. For example in the case of p = 8 and a 
hyper-systolic matrix like in equation (45), calculations for particles provided by 
processor 1 are only done on processors 2,3,4 and 5. If we only have one particle 
in a group, and this is provided by processor 1, then only these four processors 
do calculations, while the remaining four processors wait until the next integration 
step. The performance would be comparable to a 4 processor parallel computation. 

In general, if one chooses a hyper-systolic algorithm with the smallest commu¬ 
nication complexity k = y/p — 1 (equation 53) and a group size of one, then only 
y/p — 1 processors are doing the computation. 

If more than one but not all processors contribute particles to the group, the par¬ 
allelization depends strongly on the specific processor number. For example, in the 
case of matrix 45, if processors 1 and 2 provide particles, 5 processors are working 
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in parallel. If processors 1 and 5 provide particles, then all of the 8 processors are 
working in parallel. 


6. AN APPLICATION: GRAVITATIONAL BROWNIAN MOTION 

6.1. The Problem 

The algorithms described here are ideally suited to problems requiring large N 
and small to moderate numbers of integration steps. One such problem is the 
Brownian motion of a massive point particle that responds to the gravitational 
force from N , less-massive particles. Let M, and R be the mass and position of 
the Brownian particle, m the mass of a perturber, and r j the position of the jth 
perturber particle. The equations of motion are 


R = 


N 

k =1 


(R - r fc ) 

|R —r fc | 3 ’ 


N 

fc=i 


( r j 


rfc) 
rfc | 3 


+ GM, 


(R — r j) 

|R-ry| 3 ’ 


(60) 

(61) 


where the summation in equation (61) excludes k = j. The total mass of the stellar 
system excluding the Brownian particle is Nm = M. 

This problem is relevant to the behavior of supermassive black holes at the centers 
of galaxies. Black holes have masses of order 1O 6 M 0 < M. < 10 9 Mq, with M 0 the 
mass of the sun, compared with a total mass of a galactic nucleus of ~ 1O 9 M 0 and 
of an entire galaxy, ~ 10 11 M 0 . By analogy with the fluid case, the rms Brownian 
velocity of the black hole is expected to be of order 



where a is the ID velocity dispersion of stars in the vicinity of the black hole 
and m is a typical stellar mass. But Brownian motion in a self-gravitating system 
is expected to differ from that in a fluid, for a variety of reasons. A massive 
particle alters the potential when it moves and may excite collective modes in the 
background. Its effective mass may differ from M, due to particles bound to it, 
and the force perturbations acting on it are not necessarily localized in time. For 
these reasons, it is important to treat the motion of the background particles in a 
fully self-consistent way, as in equations (60-61). 

Integrations involving large particle numbers are slow even with a parallel agorithm, 
but fortunately the time required for a massive particle (the “black hole”) to reach 
a state of energy equipartition with the stars, T eq , is expected to be less than a 
single crossing time of the stellar system in which it is imbedded. We derive this 
result as follows. Starting from rest, the mean square velocity of the black hole 
should evolve approximately as 


where 


C V 2 > « {AvfM 


(Auf) 


8\/27r G 2 mp In A 


(63) 


3 


a 


(64) 
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(e.g. [22]). Here p is the mass density of stars and In A is the Coulomb logarithm, 
which is of order unity for this case ([23]) and will henceforth be set equal to 
one. The time T eq required for ( V 2 ) to reach its expected equilibrium value of 
~ ( m/M)a 2 is therefore 


rr ~ 171 _ a taz\ 

eq ~ M.(Avp ~ cm.p- [bb) 

This may be written 

^ (§f) (<&r (66> 

where (p) is the mean density of the galaxy within its half-mass radius R \/2 and 
T[j = R 1 / 2 /cr, the crossing time within that radius. 

A typical environment for a supermassive black hole is at the center of a galaxy 
in which the stellar density varies as 


P~ ( P) 



with 1 < 7 < 2 [24]. Thus 


T 

eq 


T d 


Mgal\ 

M. ) 



(67) 


( 68 ) 


To estimate the second factor in parentheses, we assume that the stars that most 
strongly affect the motion of the black hole are within a distance r ss GA/./cr 2 « 
(. M,/M g ai)Ri/2 , the classical radius of influence of the black hole [26]. Then 


e Q 


T) 


D 


f M. 

\M gal 


0.5 


(69) 


where 7 has been set equal to 1.5, roughly the value characterizing the stellar 
density cusp surrounding the Milky Way black hole [25]. Setting M,/M ga i ~ 10 “ 3 
[19], we find T eq ~ T D / 30. While this result is very approximate, it suggests that a 
massive particle will reach energy equipartition with the “stars” in much less than 
one crossing time of the stellar system. Note that this result is independent of the 
masses of the perturbing particles and hence of N. However the predicted value 
of the equilibrium velocity dispersion of the black hole does depend on m/M . (cf. 
equation 62). We evaluate this dependence below via numerical experiments. 


6.2. The Experiments 

We constructed initial conditions by distributing N = 10 6 , equal-mass particles 
representing the stars according to Dehnen’s law (35) with 7 = 1.5, 


p(r) 


3 M a 


8 n r 1 - 5 (?’ + a) 2 ' 5 


(70) 
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t 

FIG. 9. Time evolution of the squared velocity of the massive (“black hole”) particle in 
the Brownian motion experiments. The background stellar system consists of N = 10 6 particles 
of mass m = 10 -6 distributed according to a Dehnen density law, Equation (70). The mass M* of 
the black hole particle increases downward by a factor of 10 between each frame, from M# = 10 —5 
at the top to M # = 10“ 1 at the bottom. The black hole particle is started at rest and rapidly 
comes into energy equipartition with the lighter particles. The amplitude of its random motion 
varies inversely with its mass as expected from energy equipartition arguments. A close binary 
was formed after 0.016 time units in the simulation with the most massive black hole particle, 
requiring extremely small timesteps and forcing the run to stop. 
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and a central point of mass M . representing the black hole. The value chosen for 
the slope of the central density cusp, 7 = 1.5, is similar to the value near the 
center of the Milky Way galaxy [25]. The velocities of the “star” particles were 
selected from the unique isotropic distribution function that reproduces a steady- 
state Dehnen density law in the presence of a central point mass [27]. The initial 
velocity of the black hole particle was set to zero. We carried out integrations for 
five different values of the black hole mass, M . = ( 0 . 00001 , 0 . 0001 , 0 . 001 , 0 . 01 , 0 . 1 ) 
in units where the total mass M in “stars” is unity. The gravitational constant G 
and the Dehnen-model scale length a were also set to one. Integrations were carried 
out until a time of f max = 0 . 1 , compared with a crossing time for the overall system 
of a/a « 1 . In the case of the largest black hole mass tested by us, M, = 0.1, the 
high velocity of stars in the density cusp around the massive particle required the 
use of very small time steps for many of the particles. This run was not continued 
beyond t « 0.016. All runs were carried out using 128 processors on the Cray T3E 
900 supercomputer in Stuttgart. 

Figure 9 shows the time dependence of the squared velocity of the black hole 
particle, V 2 (t), for each of the runs. The motion appears to reach a statistical 
steady state in much less than one crossing time, as expected, and the inverse 
dependence of ( V 2 ) on M, is apparent. There were no discernable changes in the 
spatial or velocity distribution of the “star” particles during these runs outside of 
the region where the black hole wandered. 

Figure 10 shows V rms as a function of M, for each of the runs. V rms was computed 
by averaging V 2 (t) over the full integration interval. From equation (62), and 
setting to = 1/N = 10 -6 , we expect 

Vrms « 1-73 X 10 - 3 aM~ 0 - 5 (71) 

where a is the ID velocity dispersion of stars in the vicinity of the black hole 
particle. However it is not clear what value of a should be used in this formula, 
since the velocity dispersion of the stars is a function of radius in the initial models: 
a increases toward the center to a maximum of ~ 0.42 at a radius of ~ 0.2, then 
drops slowly inward before rising again as a 2 ss 2/5r within the radius of influence 
of the black hole. Using the peak value of a in the black-hole-free Dehnen model, 
a = 0.42, equation (71) predicts 

V rms « 7.27 x 10" 4 M.“ 0 - 5 . (72) 

This line is shown in Figure 10. The measured values of V rm s are consistent with this 
prediction, although there is a hint that the dependence of V rms on M. is slightly 
less steep than M. 5 . This result suggests that gravitational Brownian motion 
may be very similar to its fluid counterpart. In particular, we note that V rms is 
correctly predicted by equation (62) if one uses a value for <7 in that equation that 
is measured well outside of the region of gravitational influence of the black hole. 
This suggests that most of the perturbations leading to the black hole’s motion 
come from distant stars. 

We can use our results to estimate the random velocity of the supermassive 
black hole at the center of the Milky Way galaxy. It has recently become feasible 
to measure the motion of the Milky Way black hole [28, 29], whose mass is M, ss 
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FIG. 10. RMS velocity of the “black hole” particle in the Brownian motion experiments. 


Vrms was computed using the average value of V 2 (t) between t = 0 and t = 0.1, with the exception 
of the point at M# =0.1 for which the integration time was shorter (cf. Figure 9). Dashed line 
is the predicted relation based on energy equipartition arguments (see text), m is the mass of a 
“star” particle. 
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2.5 —3.5 x 10 6 M@ [30, 31, 32], roughly 10 -3 times the mass of the Milky Way bulge 
[19]. The masses of stars in the cluster surrounding the black hole are of order 
10 — 2OM 0 [33]. We adopt a « 100 km s _1 , roughly the peak velocity dispersion 
measured for the stars in the Milky Way bulge outside of the region of influence of 
the black hole [34]; here we make use of our result that the Brownian velocity in 
the iV-body simulations is correctly predicted by equation (62) when cr is replaced 
by its peak value outside of the region of influence of the central black hole. The 
resulting prediction for the 3D random velocity of the Milky Way black hole is 


V„ 


0.40 


\15Mq 


1/2 


M. 


3 x 1O 6 M 0 


- 1/2 


km s 


(73) 


The predicted velocity could be much greater than 0.4 km s -1 if the objects provid¬ 
ing the force perturbations are much more massive than 15M 0 , e.g., giant molecular 
clouds. Current upper limits on the proper motion (2D) velocity of the Milky Way 
black hole are ~ 20 km s -1 [28, 29]. 


7. CONCLUSIONS 

We have introduced two variants of a systolic algorithm for parallelizing direct- 
summation 7V-body codes implementing individual block time step integrators: the 
first with blocking communication, and the second with non-blocking communi¬ 
cation. Performance tests were carried out using /V-body models similar to those 
commonly studied by dynamical astronomers, in which the gravitational forces vary 
widely between core and halo and for which the particle block sizes are typically 
very small. The nonblocking scheme was found to provide far better performance 
than the blocking scheme for such systems, providing a nearly ideal speedup for the 
force calculations. By engaging a sufficient number of computing nodes, particle 
numbers in excess of 10 6 are now feasible for direct fV-body simulations. For par¬ 
allel machines with very large processor numbers, we describe the implementation 
of a hyper-systolic computing scheme which provides a communication scaling of 
0(y/p) at the expense of increased memory demands. 

The codes used to write this paper are available for download at: 
http://www.physics.rutgers.edu/~marchems/download.html 
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